Autoset nprocs and improve max_blocks estimate #949

anton-seaice · 2024-05-02T05:52:16Z

PR checklist

This change allows nprocs to be set to -1 in 'ice_in' and then the number of processors will be automatically detected.

This change improves the automatic calculation of max_blocks to give a better (but still not foolproof) estimate of max_blocks if it is not set in ice_in.

apcraig · 2024-05-02T17:59:22Z

This looks pretty good. A couple thoughts.

Setting nprocs=get_num_procs does not happen all the time for the cpp CESMCOUPLED. I think we'll want backwards compatibility on that for now.
The max_blocks estimate is improved, but it is still focused on even distribution of the 2d blocks. There are decompositions where it still will not work in general, decompositions that are more equal weighted in terms of estimated work vs number of blocks.

The modifications I'm working on handle the nprocs thing in much the same way, but also refactor the max_blocks calculation so each proc (mpi task) has a locally defined max_blocks that matches the decomposition exactly. I'm still developing and testing to make sure it'll work properly.

I'm happy to merge this as a first step (if the CESMCOUPLED thing is fixed). I could then update the implementation if I make additional progress. Thoughts?

anton-seaice · 2024-05-03T03:20:13Z

I'm happy to merge this as a first step (if the CESMCOUPLED thing is fixed). I could then update the implementation if I make additional progress. Thoughts?

Lets go with this. I think I made the related changes to do this.

I guess we should add some tests?

phil-blain

Thanks Anton.

This change allows nprocs to be set to -1 in 'ice_in' and then the number of processors will be automatically detected.

This change re-orders the calculation of max_blocks to ensure nprocs is set before the calculation.

This description reflects the latest state of the PR, super.

phil-blain · 2024-05-03T14:32:17Z

cicecore/cicedyn/infrastructure/ice_domain.F90

@@ -242,16 +257,6 @@ subroutine init_domain_blocks
      !*** domain size zero or negative
      !***
      call abort_ice(subname//' ERROR: Invalid domain: size < 1', file=__FILE__, line=__LINE__) ! no domain
-   else if (nprocs /= get_num_procs()) then
-      !***
-      !*** input nprocs does not match system (eg MPI) request


we could keep this comment, I think

Isn't this clear from the ERROR message?

phil-blain · 2024-05-03T14:34:16Z

cicecore/cicedyn/infrastructure/ice_domain.F90

+
+

can we avoid whitespace-only lines here ? Also I would think a single blank line is sufficient.

I made it one line instead of two :)

phil-blain · 2024-05-03T14:35:40Z

doc/source/user_guide/ug_case_settings.rst

@@ -370,6 +370,7 @@ domain_nml
   "``maskhalo_bound``", "logical", "mask unused halo cells for boundary updates", "``.false.``"
   "``max_blocks``", "integer", "maximum number of blocks per MPI task for memory allocation", "-1"
   "``nprocs``", "integer", "number of processors to use", "-1"
+   "", "``-1``", "find number of processors automatically", ""


I think while we're in here, "MPI ranks" would be more accurate than "processors" (for both the new line and the existing one).

We seem to use "processors" throughout the rest of the docs, although they seem light on for details about MPI in general.

I definitely out of my depth here .. but google implies the rank is just the unique name for each process ? So this should be number of "processes" as it could be possible to have multiple processes per processor ! (And by this logic, we should change all the other "processors" to "processes" in the docs)

I think "MPI tasks" is technically correct here. Processors and processes are both wrong. "MPI ranks" is probably the same thing, but not terminology we generally use in the document. Lets go with "MPI tasks"?

Is "Number of MPI tasks (typically 1 per processor) to use" good?

There are lots of uses of "processor" in the docs, I guess these could be changed to task? Should I make a new issue?

"typically 1 per processor" would not be a good thing to add. The way I think about this is that processors are the hardware in use while tasks and threads are how those processors are used. So nprocs is specifically associated with the number of MPI tasks only. It has nothing to do with threads. If you are running 16 mpi tasks threaded 2 ways each, you are using 32 processors and have 16 MPI tasks. nprocs is MPI tasks and has nothing to do with the "processors" or "processes" used/set. To me, processes is a word separate from hardware. If you could oversubscribe threads on a processors, you might have 1 processor but 4 processes on that processor.

Again, this is how I think about things and probably how most of the documentation uses that terminology. I fully acknowledge that my ideas about this may not be the norm. In that case, we should update the documentation using more standard language. But before we do, we should carefully think about the terminology. At this point, I would just update this one line of documentation unless you see other glaring errors. We could open an issue, but before we do, I'd like for someone to review the documentation (at least a few sections here and there) to see whether there seems to be a bunch of incorrect terminology. I prefer to have an issue that says "several things need to be fixed such as...." rather than "review documentation for correct/consistent use of tasks, threads, processors, processes".

Thanks Tony. Ill just make it say "Number of MPI tasks". I am a bit out of my depth re the best terminology for the documentation, so ill leave it for now.

I think you still need to update the documentation here. Will start some testing on the changes.

I made the update, I wonder if it no longer makes sense if compiling with the 'serial' folder instead of the 'mpi' folder ?

apcraig · 2024-05-03T18:13:49Z

cicecore/cicedyn/infrastructure/ice_domain.F90

+#ifdef CESMCOUPLED
+   nprocs = get_num_procs()
+#else
+   if (nprocs == -1) then


Lets make this (nprocs < 0)?

cicecore/cicedyn/infrastructure/ice_domain.F90

apcraig · 2024-05-03T18:19:02Z

I will run a full test suite once comments are addressed and the PR is no longer draft. Thanks!

anton-seaice · 2024-05-06T04:03:12Z

I will run a full test suite once comments are addressed and the PR is no longer draft. Thanks!

My question here is whether we should add some test cases for nprocs=-1. It may be a bit of work though, because most of the scripts are written around setting this explicitly.

Co-authored-by: anton-seaice <[email protected]>

apcraig · 2024-05-06T16:58:20Z

I will run a full test suite once comments are addressed and the PR is no longer draft. Thanks!

My question here is whether we should add some test cases for nprocs=-1. It may be a bit of work though, because most of the scripts are written around setting this explicitly.

Please run a test manually with nprocs=-1 to make sure it works. I think the question is maybe whether we should have nprocs=-1 be the default setup for all tests. You could try that and see if everything runs and is bit-for-bit. I think you'd need to set nprocs=-1 in configuration/scripts/ice_in and then remove "nprocs = ${task}" in cice.setup. Is that the direction we want to go?

anton-seaice · 2024-05-10T00:03:54Z

Our intermittent failure is back !

https://github.com/CICE-Consortium/CICE/actions/runs/9024916341/job/24799676376 failed
https://github.com/ACCESS-NRI/CICE/actions/runs/9024916168/job/24799675998 passed

But they are the same commit

apcraig · 2024-05-10T00:36:21Z

I restarted the github action failure and it passed. Ugh....

apcraig

Full test suite run on derecho with intel, all passed.

apcraig added the Software Engineering label May 2, 2024

phil-blain reviewed May 3, 2024

View reviewed changes

apcraig reviewed May 3, 2024

View reviewed changes

apcraig requested review from apcraig and phil-blain May 3, 2024 20:46

minghangli-uni and others added 7 commits May 6, 2024 14:24

update max_blocks in ice_domain.F90, enabling max_blocks set on the fly

7ac53a9

Co-authored-by: anton-seaice <[email protected]>

Make nprocs optional

eae27be

only log on master task

f8fbdec

change .eq. to == syntacx for consistency

87640f7

Add CESMCOUPLED back in

27e1c05

Add note to docs about set number of procs automatically

8e2df00

Review feedback

56304b7

anton-seaice force-pushed the CICE-iss145-refactor branch from ab0dcb3 to 56304b7 Compare May 6, 2024 04:44

anton-seaice marked this pull request as ready for review May 6, 2024 04:47

Update ug_case_settings.rst

14ac9b9

apcraig approved these changes May 10, 2024

View reviewed changes

apcraig merged commit b2a9b0f into CICE-Consortium:main May 10, 2024
2 checks passed

anton-seaice deleted the CICE-iss145-refactor branch May 12, 2024 22:51

apcraig mentioned this pull request May 30, 2024

Can we remove nprocs from ice_in ? #945

Closed

NickSzapiro-NOAA mentioned this pull request Jun 20, 2024

Sync and merge with Consortium/main (2024-09-01) NOAA-EMC/CICE#82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoset nprocs and improve max_blocks estimate #949

Autoset nprocs and improve max_blocks estimate #949

anton-seaice commented May 2, 2024 •

edited

Loading

apcraig commented May 2, 2024

anton-seaice commented May 3, 2024

phil-blain left a comment

phil-blain May 3, 2024

anton-seaice May 6, 2024

phil-blain May 3, 2024

anton-seaice May 6, 2024

phil-blain May 3, 2024

anton-seaice May 6, 2024

apcraig May 6, 2024

anton-seaice May 6, 2024

apcraig May 6, 2024

anton-seaice May 7, 2024

apcraig May 9, 2024

anton-seaice May 9, 2024

apcraig May 3, 2024

anton-seaice May 6, 2024

apcraig commented May 3, 2024

anton-seaice commented May 6, 2024

apcraig commented May 6, 2024

anton-seaice commented May 10, 2024

apcraig commented May 10, 2024

apcraig left a comment

Autoset nprocs and improve max_blocks estimate #949

Autoset nprocs and improve max_blocks estimate #949

Conversation

anton-seaice commented May 2, 2024 • edited Loading

PR checklist

apcraig commented May 2, 2024

anton-seaice commented May 3, 2024

phil-blain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apcraig commented May 3, 2024

anton-seaice commented May 6, 2024

apcraig commented May 6, 2024

anton-seaice commented May 10, 2024

apcraig commented May 10, 2024

apcraig left a comment

Choose a reason for hiding this comment

anton-seaice commented May 2, 2024 •

edited

Loading